In [27]:
    
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
sns.set_style('white')
%matplotlib inline 
import warnings
warnings.filterwarnings("ignore")
    
Remember that, if you want to play with visualization tools, you can use not only the real data, but also fake data. Actually it is a nice way to experiment because you can control every aspect of data. Let's create some random numbers.
The function np.random.randn() generates a sample with size $N$ from the standard normal distribution.
In [28]:
    
print( np.random.rand(10) )
    
    
The following small function generates $N$ normally distributed numbers:
In [29]:
    
def generate_many_numbers(N=10, mean=5, sigma=3):
    return mean + sigma * np.random.randn(N)
    
Generate 10 normally distributed numbers with mean 5 and sigma 3:
In [30]:
    
data = generate_many_numbers(N=10)
print(data)
    
    
The most immediate method to visualize 1-D data is just plotting it. Here we can use the scatter() function to draw a scatter plot. The most basic usage of this function is to provide x and y.
In [31]:
    
x = np.arange(1,11)
y = x + 5
print(x)
print(y)
plt.scatter(x, y)
    
    
    Out[31]:
    
But here we only have x (the generated data). We can set the y values to 0. The np.zeros_like(data) function creates a numpy array (list) that have the same dimension as the argument.
In [32]:
    
print(np.zeros_like(data))
    
    
Now let's plot the generated 1-D data.
In [33]:
    
plt.figure(figsize=(10,1)) # set figure size, width = 10, height = 1
plt.scatter(data, np.zeros_like(data), s=50) # set size of symbols to 50. Change it and see what happens. 
plt.gca().axes.get_yaxis().set_visible(False) # set y axis invisible
    
    
Ok, I think we can see all data points. But what if we have more numbers?
In [34]:
    
# TODO: generate 100 numbers and plot them in the same way. 
data = np.random.rand(100)
plt.figure(figsize=(10,1))
plt.scatter(data, np.zeros_like(data), s = 50) 
plt.gca().axes.get_yaxis().set_visible(False)
    
    
Of course we can't see much at the center. We can add "jitters" using the np.random.rand() function.
In [35]:
    
data = generate_many_numbers(N=100)
# TODO: create a list of 100 random numbers using np.random.rand()
# zittered_ypos = ??
zittered_ypos = np.random.rand(100)
plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s=50)
plt.gca().axes.get_yaxis().set_visible(False)
    
    
Let's also make the symbol transparent. Here is a useful Google query, and the documentation of scatter() also helps.
In [36]:
    
data = generate_many_numbers(N=200)
# From the last question
# zittered_ypos = ??
# TODO: implement this
# plt.figure(figsize=(10,1))
# plt.scatter( ?? )
# plt.gca().axes.get_yaxis().set_visible(False)
# TODO: implement this
zittered_ypos = np.random.rand(200)
plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s = 50, alpha = 0.35)
plt.gca().axes.get_yaxis().set_visible(False)
    
    
We can use transparency as well as empty symbols.
In [37]:
    
# TODO: implement this
# data = ?? 
# zittered_ypos = ??
# TODO: implement this
# plt.figure(figsize=(10,1))
# plt.scatter( ?? )
# plt.gca().axes.get_yaxis().set_visible(False)
data = np.random.rand(1000)
zittered_ypos = np.random.rand(1000)
plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s = 50, c = 'white', edgecolors='r')
plt.gca().axes.get_yaxis().set_visible(False)
    
    
Let's use real data. Load the IMDb dataset that we used before.
In [38]:
    
movie_df = pd.read_csv('imdb.csv', delimiter='\t')
movie_df.head()
    
    Out[38]:
Try to plot the 'Rating' information using 1D scatter plot. Does it work?
In [39]:
    
# TODO: plot 'rating'
rating = movie_df['Rating'].values
plt.figure(figsize=(10,1)) 
plt.scatter(rating, np.zeros_like(rating), s = 50) 
plt.gca().axes.get_yaxis().set_visible(False)
    
    
There are too many data points! Let's try histogram. Actually pandas supports plotting through matplotlib and you can directly visualize dataframes and series.
In [40]:
    
movie_df['Rating'].hist()
    
    Out[40]:
    
Looks good! Can you increase or decrease the number of bins? Find the documentation here.
In [41]:
    
# TODO: try different number of bins
movie_df['Rating'].hist(bins = 30)
    
    Out[41]:
    
In [42]:
    
movie_df['Rating'].hist(bins = 20)
    
    Out[42]:
    
Now let's try boxplot. We can use pandas' plotting functions. The usages of boxplot is here.
In [43]:
    
movie_df['Rating'].plot(kind='box', vert=False)
    
    Out[43]:
    
Or try seaborn's boxplot() function:
In [44]:
    
sns.boxplot(movie_df['Rating'])
    
    Out[44]:
    
We can also easily draw a series of boxplots grouped by categories. For example, let's do the boxplots of movie ratings for different decades.
In [45]:
    
df = movie_df.sort('Year')
df.head()
    
    Out[45]:
One easy way to transform a particular year to the decade (e.g., 1874 -> 1870): divide by 10 and multiply it by 10 again.
In Python 3, the // operator is used for integer division.
In [46]:
    
print(1874//10)
print(1874//10*10)
decade = (df['Year']//10) * 10
decade.head()
    
    
    Out[46]:
In [47]:
    
ax = sns.boxplot(x=decade, y=df['Rating'])
ax.figure.set_size_inches(12, 8)
    
    
Can you draw boxplots of movie votes for different decade?
In [48]:
    
# TODO
ax = sns.boxplot(x=decade, y=df['Votes'])
ax.figure.set_size_inches(12, 8)
    
    
What do you see? Can you actually see the "box"? The number of votes span a very wide range, from 1 to more than 1.4 million. One way to deal with this is to make a log-transformation of votes, which can be done with the numpy.log() function.
In [49]:
    
log_votes = np.log(df['Votes'])
log_votes.head()
    
    Out[49]:
Can you draw boxplots of log-transformed movie votes for different decade?
In [50]:
    
# TODO
ax = sns.boxplot(x=decade, y = log_votes)
ax.figure.set_size_inches(12, 8)
    
    
In [ ]: